SmartCrawl: Deep Web Crawling Driven By Data Enrichment
نویسندگان
چکیده
Entity resolution is defined as finding different records that refer to the same real-world entity. In this paper, we study deep entity resolution (DeepER) which aims to find pairs of records that describe the same entity between a local database and a hidden database. The local database can be accessed freely but the hidden database can only be accessed by a keyword-search query interface. To the best of our knowledge, we are the first to study this problem. We first show that straightforward solutions are inefficient because they fail to exploit the ideas of query sharing and local-database-aware crawling. In response, we propose SMARTCRAWL, a novel framework to overcome the limitations. Given a budget of b queries, SMARTCRAWL first constructs a query pool based on the local database and then iteratively issues b queries to the hidden database such that the union of the query results can cover the maximum number of records in the local database. Finally, it performs entity resolution between the local database and the crawled records. We find that query selection is the most challenging aspect, and we investigate how to select the query with the largest benefit at each iteration. SMARTCRAWL seeks to use a hidden database sample to estimate the query benefit. We propose unbiased estimators as well as biased estimators (with small biases) to achieve this goal, and devise efficient algorithms to implement them. We found that (1) biased estimators are much more effective than unbiased estimators, especially when the sample is small (e.g., 0.1%); (2) SMARTCRAWL is more robust to data errors than straightforward solutions. Experimental results over simulated and real hidden databases show that SMARTCRAWL can cover a large portion of the local database with a small budget, outperforming straightforward solutions by a factor of 2 − 7× in a large variety of situations.
منابع مشابه
Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment
Data enrichment is the act of extending a local database with new attributes from external data sources. In this paper, we study a novel problem—how to progressively crawl the deep web (i.e., a hidden database) through a keywordsearch interface for data enrichment. This is challenging because these interfaces often enforce a top-k constraint, or they have limits on the number of queries that ca...
متن کاملDeeper: A Data Enrichment System Powered by Deep Web
Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or timeint...
متن کاملA Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases
The Web has been rapidly “deepened” by massive databases online: Recent surveys show that while the surface Web has linked billions of static HTML pages, a far more significant amount of information is “hidden” in the deep Web, behind the query forms of searchable databases. With its myriad databases and hidden content, this deep Web is an important frontier for information search. In this pape...
متن کاملIntelligent Web Crawling
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of m...
متن کاملCrawling Deep Web Using a New Set Covering Algorithm
Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this applicatio...
متن کامل